SUBMITTED TO IEEE TRANSACTION ON INFORMATION THEORY 



I 



Local approximate inference algorithms 

Kyomin Jung and Devavrat Shah 



o 
o 

o 
O 

m 



< 



> 



o 

o 



X 
J3 



Abstract — We present a new local approximation algorithm 
for computing Maximum a Posteriori (MAP) and log-partition 
function for arbitrary exponential family distribution represented 
by a finite-valued pair-wise Markov random field (MRF), say G. 
Our algorithm is based on decomposition of G into appropriately 
chosen small components; then computing estimates locally in 
each of these components and then producing a good global 
solution. Our algorithm for log-partition function provides prov- 
able upper and lower bounds on the correct value for arbitrary 
graph G. For MAP, our algorithm provides approximation with 
quantifiable error for arbitrary G. Specifically, we show that if the 
underlying graph G either excludes some finite-sized graph as its 
minor (e.g. Planar graph) or has low doubling dimension (e.g. any 
graph with geometry), then our algorithm will produce solution 
for both questions within arbitrary accuracy. The running time 
of the algorithm is 0{n) (n is the number of nodes in G), with 
constant dependent on accuracy and either doubling dimension, 
or maximum vertex degree and the size of the graph that is 
excluded as a minor (e.g. 3 for all Planar graphs). 

We present a message-passing implementation of our algo- 
rithm for MAP computation using self-avoiding walk of graph. In 
order to evaluate the computational cost of this implementation, 
we derive novel tight bounds on the size of self-avoiding walk 
tree for arbitrary graph, which may be of interest in its own 
right. 

As a consequence of our algorithmic result, we show that the 
normalized log-partition function (also known as free-energy) 
for a class of regular MRFs (e.g. Ising model on 2-dimensional 
grid) will converge to a limit, that is computable to an arbitrary 
accuracy, as the size of the MRF goes to infinity. This method, like 
classical sub-additivity method, is likely to be widely applicable. 

Index Terms — Markov random fields; approximate inference; 
low doubling-dimension graphs; minor-excluded graphs; planar 
graphs; MAP-estimation; log-partition function; message-passing 
algorithms; self-avoiding walk. 



I. Introduction 

Markov Random Field (MRF) [1] based exponential family 
of distribution allows for representing distributions in an 
intuitive parametric form. Therefore, it has been successful 
in modeling many applications (see, [2] for details). The 
key operational questions of interest are related to statistical 
inference: computing most likely assignment of (partially) 
unknown variables given some observations and computation 
of probability of an assignment given the partial observations 
(equivalently, computing log-partition function). In this paper, 
we study the question of designing efficient local algorithms 
for solving these inference problems. 
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A. Previous work 

The question of finding MAP (or ground state) of a given 
MRF comes up in many important application areas such as 
coding theory, discrete optimization, image denoising. Simi- 
larly, log-partition function is used in counting combinatorial 
objects [3], loss-probability computation in computer net- 
works, [4], etc. Both problems are NP-hard for exact and even 
(constant) approximate computation for arbitrary graph G. 
However, the above stated applications require solving these 
problems using very simple algorithms. A popular successful 
approach for designing efficient heuristics has been as follows. 
First, identify a wide class of graphs that have simple algo- 
rithms for computing MAP and log-partition function. Then, 
for any given graph, approximately compute solution either 
by using that simple algorithm as a heuristic or in a more 
sophisticated case, by possibly solving multiple sub-problems 
induced by sub-graphs with good graph structures and then 
combining the results from these sub-problems to obtain a 
global solution. 

Such an approach has resulted in many interesting recent 
results starting the Belief Propagation (BP) algorithm designed 
for Tree graph [1]. Since there is a vast literature on this topic, 
we will recall only few results. In our opinion, two important 
algorithms proposed along these lines of thought are the gen- 
eralized belief propagation (BP) [5] and the tree-reweighted 
algorithm (TRW) [6]-[8]. Key properties of interest for these 
iterative procedures are the correctness of their fixed points 
and convergence. Many results characterizing properties of 
the fixed points are known starting from [5]. Various suf- 
ficient conditions for their convergence are known starting 
[9]. However, simultaneous convergence and correctness of 
such algorithms are established for only specific problems, 
e.g. [10H12]. 

Finally, we discuss two relevant results. The first result 
is about properties of TRW. The TRW algorithm provides 
provable upper bound on log-partition function for arbitrary 
graph [8]. However, to the best of authors' knowledge the error 
is not quantified. The TRW for MAP estimation has a strong 
connection to specific Linear Programming (LP) relaxation of 
the problem [7]. This was made precise in a sequence of work 
by Kolmogorov [13], Kolmogorov and Wainwright [12] for 
binary MRF. It is worth noting that LP relaxation can be poor 
even for simple problems. 

The second is an approximation algorithm proposed by 
Globerson and Jaakkola [14] to compute log-partition function 
using Planar graph decomposition (PDC). PDC uses tech- 
niques of [8] in conjunction with known result about exact 
computation of partition function for binary MRF when G 
is Planar and the exponential family has a specific form 
(binary pairwise and multiplicative potentials). Their algorithm 
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provides provable upper bound for arbitrary graph. However, 
they do not quantify the error incurred. Further, their algorithm 
is hmited to binary MRF. 

B. Contributions 

We propose a novel local algorithm for approximate com- 
putation of MAP and log-partition function. For any e > 0, 
our algorithm can produce an e-approximate solution for MAP 
and log-partition function for arbitrary MRF G as long as G 
has either of these two properties: (a) G has low doubling 
dimension (see Theorems |2] and |5]l, or (b) G excludes a 
finite-sized graph as a minor (see Theorems [3] and |6]l. For 
example. Planar graph excludes K5 as a minor and thus 
our algorithm provides approximation algorithms for Planar 
graphs. 

The running time of the algorithm is 9(ri), with constant 
dependent on e and (a) doubling dimension for doubling 
dimension graph, or (b) maximum vertex degree and size 
of the graph that is excluded as minor for minor-excluded 
graphs. For example, for 2-dimensional grid graph, which has 
doubling dimension 0(1), the algorithm takes G{e)n time, 
where loglogC(e) = 0{l/e). On the other hand, for a planar 
graph with maximum vertex degree a constant, i.e. 0(1), the 
algorithm takes C'{e)n time, with loglogC"(e) = 0{l/e). 

In general, our algorithm works for any G and we can 
quantify bound on the error incurred by our algorithm. It is 
worth noting that our algorithm provides a provable lower 
bound on log-partition function as well unlike many of the 
previous results. 

Our algorithm is primarily based on the following idea: 
First, decompose G into small-size connected components say 
Gi, . . . , Gfe by removing few edges of G. Second, compute 
estimates (either MAP or log-partition) in each of the Gi 
separately. Third, combine these estimates to produce a global 
estimate while taking care of the error induced by the removed 
edges. We show that the error in the estimate depends only 
on the edges removed. This error bound characterization is 
applicable for arbitrary graph. 

For obtaining sharp error bounds, we need good graph 
decomposition schemes. Specifically, we use a new, simple and 
very intuitive randomized decomposition scheme for graphs 
with low doubling dimensions. For minor-excluded graphs, 
we use a simple scheme based on work by Klein, Plotkin 
and Rao [15] and Rao [16] that they had introduced to study 
the gap between max-flow and min-cut for multicommodity 
flows. In general, as long as G allows for such good edge-set 
for decomposing G into small components, our algorithm will 
provide a good estimate. 

To compute estimates in individual components, we use 
dynamic programming. Since each component is small, it is 
not computationally burdensome. However, one may obtain 
further simpler heuristics by replacing dynamic programming 
by other method such as BP or TRW for computation in the 
components. 

In order to implement dynamic programing using message- 
passing approach, we use construction based on self-avoiding 
walk tree. Self-avoiding walk trees have been of interest in 



statistical physics for various reasons (see book by Madras 
and Slade [17]). Recently, Weitz [18] obtained a surprising 
result that connected computation of marginal probability of 
a node in any binary MRF to that of marginal probability of 
a root node in an appropriate self-avoiding walk tree. We use 
a direct adaption of this result for computing MAP estimate 
to design message passing scheme for MAP computation. In 
order to evaluate computation cost, we needed tight bound on 
the size on self-avoiding walk tree of arbitrary graph G. We 
obtain a novel characterization of size of self-avoiding walk 
tree within a factor 8 for arbitrary graph G. This result should 
be of interest in its own right. 

Finally, as a (somewhat unexpected) consequence of these 
algorithmic results, we obtain a method to establish existence 
of asymptotic limits of free energy for a class of MRF. 
Specifically, we show that if the MRF is d-dimensional grid 
and all node, edge potential functions are identical then the 
free-energy (i.e. normalized log-partition function) converges 
to a limit as the size of the grid grows to infinity. In general, 
such approach is likely to extend for any regular enough MRF 
for proving existence of such limit: for example, the result 
will immediately extend to the case when the requirement of 
node, edge potential being exactly the same is replaced by the 
requirement of they being chosen from a common distribution 
in an i.i.d. fashion. 

C. Outline 

The paper is organized as follows. Section |ll] presents 
necessary background on graphs, Markov random fields, ex- 
ponential family of distribution, MAP estimation and log- 
partition function computation. 

Section |lll] presents graph decomposition schemes. These 
decomposition schemes are used later by approximation al- 
gorithms. We present simple, intuitive and 0{n) running 
time decomposition schemes for graphs with low doubling 
dimension and graphs that exclude finite size graph as a minor 
Both of these schemes are randomized. The first scheme is 
our original contribution. The second scheme was proposed 
by Klein, Plotkin and Rao [15] and Rao [16]. 

Section |IV] presents the approximation algorithm for com- 
puting log-partition function. We describe how it provides 
upper and lower bound on log-partition function for arbitrary 
graph. Then we specialize the result for two graphs of interest: 
low doubling dimension and minor excluded graphs. 

Section |V] presents the approximation algorithm for MAP 
estimation. We describe how it provides approximate estimate 
for arbitrary graph with quantifiable approximation error 
Then we specialize the result for two graphs of interest: low 
doubling dimension and minor excluded graphs. 

Section |VT] describes message passing implementation of 
the MAP estimation algorithm for binary pair-wise MRF for 
arbitrary G. This can be used by our approximation algorithm 
to obtain message passing implementation. This algorithm 
builds upon work by Weitz [18]. We describe a novel tight 
bound on the size of self-avoiding walk tree for any G. This 
helps in evaluating the computation time. The message passing 
implementation has similar computation complexity as the 
centralized algorithm. 
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Section IVIII presents an experimental evaluation of our 
algorithm for popular synthetic model on a grid graph. We 
compare our algorithm with TRW and PDC algorithms and 
show that our algorithm is very competent. An important 
feature of our algorithm is scalability. 

Section IVIIII presents the impUcation of our algorithmic 
result in establishing asymptotic limit of free energy for regular 
MRFs, such as an Ising model on d-dimensional grid. 

D. How to read this paper: a suggestion 

A reader, interested in obtaining a quick understanding of 
the results, should skip everything in Section other than 
the definition of (e, A) decomposition, and skip the Section 
|yi] completely. Reading these two sections at the very end may 
be helpful to parse the results with ease for all the readers. 

II. Preliminaries 

This section provides the background necessary for subse- 
quent sections. We begin with an overview of some graph 
theoretic basics. We then describe formalism of Markov ran- 
dom field and exponential family of distribution. We formulate 
the problem of log-partition function computation and MAP 
estimation for Markov random field. We conclude by stating 
precise definitions of approximate MAP estimation and ap- 
proximate log-partition function computation. 

A. Graphs 

An undirected graph G ~ (F, E) consists of a set of vertices 
V = {1, . . . , n} that are connected by set of edges E C V x 
V. We consider only simple graphs, that is multiple edges 
between a pair of nodes or self-loops are not allowed. Let 
r(u) = {u G V : {u,v) G E} denote the set of all neighboring 
nodes of v & V. The size of the set T{v) is the degree of node 
V, denoted as dy. Let d* = max„gy dy be the maximum vertex 
degree in G. A clique of the graph G is a fully-connected 
subset C of the vertex set (i.e. (it, v) <E E for all u,v € C). 
Nodes u and v are called connected if there exists a path in 
G starting from u and ending at v or vice versa since G is 
undirected. Each graph G naturally decomposes into disjoint 
sets of vertices Vi, . . . ,Vk where for 1 < i < k, any two 
nodes say u,v € Vi are connected. The sets Vi, . . . ,Vk are 
called the connected components of G. 

We introduce a popular notion of dimension for graph G 
(see recent works [19] [20] [21] for relevant details). Define 
da --V xV as 

dcihj) = length of the shortest path between i and j. 

If i ^ j then dG{i, j) = and if i,j are not connected, then 
define dG,{i,j) = oo. It is easy to check that thus defined 
do is a metric on vertex set V. Define ball of radius r G M+ 
around v <E V as B(w, r) = {u G V : dG{u, v) < r}. Define 

p{v, r) = inf{/v e N : 3 mi, . . . , G 1/, 

Then, p{G) ~ sup^gy ^g^^ r) is called the doubling 
dimension of graph G. Intuitively, this definition captures the 



notion of dimension d in the Euclidian space K'^. It follows 
from definition that for any graph G, p{G) ~ ©(logjn). 
We note the following property whose proof is presented in 
Appendix lAl 

Lejmna 1: For any v e V and r e N, |B(v, 2'')| < 2''p^-^\ 
Next, we introduce a class of graphs known as minor- 
excluded graphs (see a series of publications by Roberston 
and Seymour under "the graph minor theory" project [22]). A 
graph H is called minor of G if we can transform G into H 
through an arbitrary sequence of the following two operations; 
(a) removal of an edge; (b) merge two connected vertices u, v: 
that is, remove edge {u, v) as well as vertices u and v; add 
a new vertex and make all edges incident on this new vertex 
that were incident on u or v. Now, if H is not a minor of G 
then we say that G excludes H as a minor. 

The explanation of the following statement may help un- 
derstand the definition better: any graph H with r nodes is a 
minor of Kr, where Kr is a complete graph of r nodes. This 
is true because one may obtain H by removing edges from 
Kr that are absent in H. More generally, if G is a subgraph 
of G' and G has iJ as a minor, then G' has H as its minor 
Let Kr.r denote a complete bipartite graph with r nodes in 
each partition. Then Kr is a minor of Kr^r- An important 
implication of this is as follows: to prove property P for graph 
G that excludes H, of size r, as a minor, it is sufficient to prove 
that any graph that excludes Kr.r as a minor has property P. 
This fact was cleverly used by Klein et. al. [15]. 

B. Markov random field 

A Markov Random Field (MRF) is defined on the basis of 
an undirected graph G = (V, E) in the following manner. Let 
V ^ {1, . . . ,n} and E C V X V . For each v e V, let Xy 
be random variable taking values in some finite valued space 
El,. Without loss of generality, lets assume that E„ = E for 
all 17 G y. Let X = {Xi, . . . , X„) be the collection of these 
random variables taking values in E". For any subset A C V, 
we let Xyi denote {Xy\v G A}. We call a subset S C V a 
cut of G if by its removal from G the graph decomposes into 
two or more disconnected components. That is, V\S = AUB 
with A n S = and there for any a G A,b € B, (a, b) ^ E. 
The X is called a Markov random field, if for any cut S C V, 
X/i and Xs are conditionally independent given Xg, where 
V\S ^ AUB. 

By the Hammersley-Clifford theorem, any Markov random 
field that is strictly positive (i.e. Pr(X = x) > for all 
X G E") can be defined in terms of a decomposition of the 
distribution over cliques of the graph. Specifically, we will 
restrict our attention to pair-wise Markov random fields (to be 
defined precisely soon) only in this paper. This does not incur 
loss of generality for the following reason. A distributional 
representation that decomposes in terms of distribution over 
cliques can be represented through a factor graph over discrete 
variables. Any factor graph over discrete variables can be 
tranformed into a pair-wise Markov random field (see, [7] 
for example) by introducing auxiliary variables. As reader 
shall notice, the techniques of this paper can be extended 
to Markov random fields with higher-order interaction that 
contains hyper-edges. 
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Now, we present the precise definition of pair-wise Markov 
random field. We will consider distributions in exponential 
form. For each vertex v G V and edge {u,v) G E, the 



III. Graph decomposition 



corresponding potential functions are 0^, : S ^ M+ and 
ipuv ■ M+. Then, the distribution of X is given as 

follows: for x e E", 



Pr[X = x] cx exp j 



yveV 



E 

{u,v)eE 



In this section, we introduce notion of graph decomposition. 
We describe very simple algorithms for obtaining decompo- 
sition for graphs with low doubling dimension and minor- 
excluded graphs. In the later sections, we will show that such 
decomposable graphs are good structures in the sense that they 
allow for local algorithms for approximately computing log- 
ipuv{xu, Xy) I (.1) partition function and MAP. 



We note that the assumption of 0^, , ipuv being non-negative 
does not incur loss of generality for the following reasons: 
(a) the distribution remains the same if we consider potential 
functions + C, ipuv + C, for all u G V, (u, v) ^ E 
with constant C; and (b) by selecting large enough constant, 
the modified functions will become non-negative as they are 
defined over finite discrete domain. 

C. Log-partition function 

The normalization constant in definition ([U of distribution 
is called the partition function, Z. Specifically, 



Z : 




(u,v)£E 



Clearly, the knowledge of Z is necessary in order to evaluate 
probability distribution or to compute marginal probabilities, 
i.e. Pr(X„ = Xu) for v ^ V . In applications in computer 
science, Z corresponds to the number of combinatorial objects, 
in statistical physics normalized logarithm of Z provides free- 
energy and in reversible stochastic networks Z provides loss 
probability for evaluating quality of service. 

In this paper, we will be interested in obtaining estimate of 
\ogZ. Specifically, we will call Z as an e-approximation of 
Z if 

(1 - e) log Z < log Z < (1 + e) log Z. 

D. MAP assignment 

The maximum a posteriori (MAP) assignment x* is one 
with maximal probability, i.e. 

X* G arg max PrfX = xl. 

xSS" 

Computing MAP assignment is of interest in wide variety 
of applications. In combinatorial optimization problem, x* 
corresponds to an optimizing solution, in the context of image 
processing it can be used as the basis for image segmentation 
techniques and in error-correcting codes it corresponds to 
decoding the received noisy code-word. 

In our setup, MAP assignment x* corresponds to 



X* G arg max I ^ 4>v{xv) + ^ •4iuv{xu,Xv) 

\v£V {u,v)eE 

Define, H(x) = iT,vev (f'v{xv) + T,{u,v)eE "^uvixu, Xv))- We 
will be interested in obtaining an e estimate, say x, of x* such 
that 

(1 -e)H(x*) < < W(x*). 



A. (e, A) decomposition 

Given e, A > 0, we define notion of (e, A) decomposition 
for a graph G = (V, E). This notion can be stated in terms of 
vertex-based decomposition or edge-based decomposition. 

We call a random subset of vertices B C V as (e, A) vertex- 
decomposition of G if the following holds: (a) For any v G V, 
Pr{v G yB) < £. (b) Let Si, ... , Sk be connected components 
of graph G' = {V , E') where V = V\B and E' = {(m, v) G 
E : u,v <E V'}. Then, maxi<fe<x |'5'fe| < A with probability 
1. 

Similarly, a random subset of edges B C E is called an 
(e, A) edge-decomposition of G if the following holds: (a) For 
any e e E, Pr(e E B) < e. (b) Let Si, Sk be connected 
components of graph G" = {V , E') where V' = V and E' = 
E\B. Then, ma.xi<k<K \Sk\ < A with probability 1. 

B. Low doubling-dimension graphs 

This section presents (e, A) decomposition algorithm for 
graphs with low doubling dimension for various choice of 
e and A. Such a decomposition algorithm can be obtained 
through a probabilistically padded decomposition for such 
graphs [20]. However, we present our (different) algorithm 
due to its simplicity. Its worth noting that this simplicity of 
the algorithm requires proof technique different (and more 
complicated) than that known in the literature. 

We will describe algorithm for node -based (e, A) decompo- 
sition. This will immediately imply algorithm for edge-based 
decomposition for the following reason: given G = (V, E) 
with doubling dimension p{G), consider graph of its edges 
Q = {E, £) where (e, e') G £ if e, e' shared a vertex in G. It 
is easy to check that p{Q) < 2p{G) + 1. Therefore, running 
algorithm for node-based decomposition on Q will provide an 
edge-based decomposition. 

The node-based decomposition algorithm for G will be 
described for the metric space on V with respect to the shortest 
path metric dc introduced earlier Clearly, it is not possible 
to have (e, A) decomposition for any e and A values. As will 
become clear later, it is important to have such decomposition 
for e and A being not too large (specifically, we would like 
A = 0{\ogTi)). Therefore, we describe algorithm for any 
£ > and an operational parameter K. We will show that 
the algorithm will produce (e, A) node-decomposition where 
A will depend on e,K and p. 

Given e and K, define random variable Q over {!,..., K} 

as 

£(1 - £)'-! if l<i < K 
(l-£)^-i if l^K 



Pr[Q = i] = 
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Define, Pk = (1 - e)^~^. The algorithm DB-DlM(e,/^) 
described next essentially does the following: initially, all 
vertices are colored white. Iteratively, choose a white vertex 
arbitrarily. Let ut be vertex chosen in iteration t. Draw an 
independent random number as per distribution of Q, say Qf . 
Select all white vertices that are at distance Qt from ut in B 
and color them blue; color all white vertices at distance < Qt 
from Ut (including itself) as red. Repeat this process till no 
more white vertices are left. Output B (i.e. blue nodes) as the 
decomposition. Now, precise description of the algorithm. 

Db-dim(£, K) 

(1) Initially, set iteration number t ^ Q, Wo ^ V , Bq ^ % 
and 7^o = 0. 

(2) Repeat the following till Wt ^ 0: 

(a) Choose an element Ut G Wt uniformly at random. 

(b) Draw a random number Qt independently according 
to the distribution of Q. 

(c) Update 

(i) Bt+i ^ Btyj{w\AG{ut,w) = Qt and w G Wt}, 

(ii) TZt+i ^ 7?.t U {w\dG{ut,w) < Qt and w e 
Wt}, 

(iii) Wt+i ^ Wt n {Bt+i u 7^t+l)^ 

(d) Increment t ^ t + 1. 

(3) Output Bf 

We state property of the algorithm DB-DlM(e, isT) as fol- 
lows. 

Lemma 2: Given G with doubling dimension p ~ p{G) 
and e G (0,1), let K{e,p) = ^\og(^) . Then Db- 
myi{£ , K [e , p)) produces random output B Cl V that is 
(2e, A(e, p)) node-decomposition of G with A(e,/ci) < 
K{e,p)^P. The algorithm takes 0{C{e, p)n) amount of time 
to produce B, where C(e, p) = K{e, p)^''. 

Before presenting the proof of Lemma |2] we state the 
following important corollary for designing efficient algorithm. 

Corollary 1: Let e < l,p be such that p\og{p/e) = 
o(loglogn). Then DB-DlM(e/2, K{e/2, p)) produces 



{£, log 



i/i. 



node-decomposition for any finite (not scaling 



with n) L . 

Proof: Since plog(p/e) = o(loglogn), we have that 

+ log log 

e e 

Therefore, by definition of K{e,p) we have that 



2p log- 



= o(loglogn). 



/i(e/2,p) 



2p 



cxp 2p 



log 



24p 



log log 



48p 



< 



exp (o(log logn)) 
logi/^ n, 



(2) 



for any finite L. The last inequahty follows from the definition 
of notation o( ). Now, Lemma |2] implies the desired claim. ■ 
Proof: (Lemma ^ To prove claim of Lemma, we need 
to show two properties of the output set B for given e and 
K = K{e,p): (a) for any v G V, Pr(w £ B) < 2e; (b) 



the graph G, upon removal of B, decomposes into connected 
component each of size at most K^p. 

Before, we prove (a) and (b), lets bound the running time 
of the algorithm. Note that the algorithm runs for at most n 
iterations. In each iteration, the algorithm needs to check nodes 
that are within distance K{e, p) of the randomly chosen node. 
Therefore, total number of operations performed is at most 
0{K{e, p)'^P). Thus, the total running time is 0{nK{e, p^P). 
Now we first justify (a) and then (b). 

Proof of (a). To prove (a), we use the following Claim. 

Claim 1: Consider metric space M. = (V,dG) with \V\ = 
n. Let B C V he the random set that is output of decompo- 
sition algorithm with parameter (e, A') applied to A4. Then, 
for any v £ V 

Pr[v G B] <e + Pk\B{v,K)\, 

where B(w, K) is the ball of radius K in A4 with respect to 
the do. 

Proof: ( ClaimU) The proof is by induction on the number 
of points n over which metric space is defined. When n = 1, 
the algorithm chooses only point as uq in the initial iteration 
and hence it can not be part of the output set B. That is, for 
this only point, say v, 

Pr[iJ eB] = <£ + Pk\B{v,K)\. 

Thus, we have verified the base case for induction {n = 1). 

As induction hypothesis, suppose that the Claim [T] is true 
for any metric space on n points with n < N for some N > 2. 
As the induction step, we wish to establish that for a metric 
space M = {V^da) with \V\ = N, the Claim □ is true. 
For this, consider any point v G V. Now consider the first 
iteration of the algorithm applied to M. The algorithm picks 
Uq £ V uniformly at random in the first iteration. Given v, 
depending on the choice of uq we consider four different cases 
(or events). 

Case L This case corresponds to event Ei where the chosen 
random uq is equal to point v of our interest. By definition of 
algorithm, under the event Ei, v will never be part of output 
set B. That is, 

Pr[w G Sl^i] = < £ + Pk\B{v,K)\. 

Case 2. Now, suppose uq be such that v ^ and 
dG(wo,f) < K. Call this event E2. Further, depending on 
choice of random number Qo, define events: 

£'21 = {dG(wo, v) < Qo}, £^22 = {dG(Mo, v) ^ Qo}, and 

E23 = {dG(uo,u) > Qo}- 

By definition of algorithm, when £^21 happens, v is selected 
as part of TZi and hence can never be part of output B. 
When E22 happens v is selected as part of Bi and hence 
it is definitely part of output set B. When £^23 happens, v is 
neither selected in set TZi nor selected in set Bi. It is left as 
an element of the set Wi. This new set Wi has points < N. 
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The original metric do is still a metric on point^ of Wi . By 
definition, the algorithm only cares about (WijCIg) in future 
and is not affected by its decisions in past. Therefore, we can 
invoke induction hypothesis which implies that if event £'23 
happens then the probability of v G S is bounded above by 
e + Pk\'B{v,K)\. Finally, let us relate the Pr[i?2i|i?2] with 
Pr[i?22|-E'2]. Suppose dcr{uo,v) = £ < K. By definition of 
probability distribution of Q, we have 



(3) 



K-l 



Case 4. Finally, let E4 be the event that dG,{uo,v) > K. 
Then, at the end of the first iteration of the algorithm, we 
again have the remaining metric space (WijCIg) such that 
Wi| < N. Hence, as before by induction hypothesis we will 
have 

Pr[w G B\E4\ <e + PK\^{v,K)\. 

Now, the four cases are exhaustive and disjoint. That is, 
^i=iEi is the universe. Based on the above discussion, we 
obtain the following. 



K-l 



{l-eY. 



(4) 



That is. 



A 



Pr[£;22|-B2] = 



1 -e 



Pr[S2i|S2] 



< (maxPr[w £ B\E.i\ 

\ i—l 



\i=l 



< 



Pk\B{v,K)\. 



(9) 



Let q = Pr[i;2i|£'2]. Then, 

Pr[z; e B\E2] = Pt[v e B\E2i n E2] Pr[£;2i|£^2] 

+ Pt[v e B\E22 n E2] Pr[£;22|£^2] 
+ Pr[i; e B\E23 n E2] Pr[£;23|£^2] 

+is + PK\Biv,K)\)(l 



l-e 

= e + PK\B{v,K)\ 
< e + PK\Biv,K)\ 



l-e 



qPK\B{v,K)\ 
l-e 



(5) 



Case 3. Now, suppose uq ^ v is such that dG(uo, v) ~ K. 
Call this event E^. Further, let event E^i ~ {Qo = K}- Due 
to independence of selection of Qo, Prix's! I^'a] = Pk- Under 
event i?3i fl E's, w e S with probability 1. Therefore, 

Pr[u G B\E3] = Pr[w G B\E:ii n E:i] Vi[E:ii\E:i\ 

+ Pv[v^B\El^r\E:i] Pv[E-^^\E^] 
= Pk + Pr[w e B\El^ H E3] (1 - Pk).(6) 

Under event, E'gj n E^, we have v G Wi and the remaining 
metric space (Wijdc). This metric space has < N points. 
Further, the ball of radius K around v with respect to this 
new metric space has at most \B{v,K)\ — 1 points (this ball 
is with respect to the original metric space of points). 
We can invoke induction hypothesis for this new metric space 
(because of similar justification as in the previous case) to 
obtain 

Pr[veB\El^nE3] < e + PK{\B{v,K)\~l). (7) 
From (|6]l and O, we have 

PT[veB\E3] < PK + il-PK)ie + PK{\B{v,K)\-l)) 



This completes the proof of Claim [T] ■ 
Now, we will use Claim[T]to complete the proof of (a). Lemma 
[1] for metric space with doubling dimension p and integer 
distances imply that, 

\B{v,K)\ < B (z;,2ri°g2-^'T) < 2P^^°s^'^+^^ = {2KY. 

Therefore, it is sufficient to show that 

PKi2K)P<e. 

Recall that K{e, p) = ^ log {^) , and Pk = {1 - e)^"^ 
Hence, 



K = 



> 



l^,„g(^) 



e 



(10) 



Now since K > 3, we obtain that K-l>^ log2i^. Then, 
from K > - and p > 1, 



K 



2p 2 1 

1 > — log2A'+ - log-. 

e e e 



eil-PK)+PK\Biv,K)\ 
+ PUl-\B{v,K)\) 
< e + PK\B{v,K)\. 



(8) 



' Note the following subtle but cmcial point. We are not changing the metric 
dc after we remove points from original set of points as part of the algorithm. 



Note that log(l - e)~^ > log(l + £)>§, for e G (0, 1). 
Hence, 

(JT - 1) log(l -£)-!> plog2J^ + log i 

e 

which implies 

{l-e)^~\2K)P < e. 
This completes the proof of (a) of Lemma |2l 

Proof of (b). First we give some notations. Define Rt = 

Uf - Ut-i, Bt =Bt- Bt-i and 

dRt ^ {v e V : V Rt and 3 v' e Rt s.t. dG{v,v') = 1}. 

The followings are straightforward observations implied by the 
algorithm: for any t > 0, (i) RtnUt-i = 0, (ii) BtCiBt-i ^ 0, 
(iii) i?t C B(wt_i,Qt_i), and (iv) Bt C B{ut-i, Qt-i + 1) - 
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B(ut_i, Qt-i). Now, we state a crucial claim for proving (b). 

Claim 2: For all t > 0, dRt C Bt- 

Proof: (Claim |2| We prove it by induction. Initially, 
dRo = Bq — ^ and hence the claim is trivial. At the end 
of the first iteration, by definition of the algorithm 

Ri = TZi^ B(uo, Qo), and 

Bi=Bi= B(uo, Qo + 1) - B(uo, Oo). 

Therefore, by definition dRi = Bi. Thus, the base case of 
induction is verified. Now, as the hypothesis for induction 
suppose that dRt C Bt for all t < i, for some € > 1. As 
induction step, we will establish that dRg+i C Bi+i. 

Suppose to the contrary, that dRt+i <f. B^+i. That is, there 
exists V G dRi+i such that v ^ Be. By definition of algorithm, 
we have 

Ri+i^B{ui,Qi)-{TZiUBi). 

Therefore, 

dRi+i c {B{ui, Qe + l)- B{ue, Qi)) U U Bi. 
Again, by definition of the algorithm we have 

Be+i = B{ut, Qi + \)- B(w£, Qe) -Tie- Be. 

Therefore, v G B^+i ot v G TZiUBe. Recall that by definition 
of algorithm Be CiTZe = 0. Since we have assumed that v ^ 
Be+i, it must be that v S TZe. That is, there exists £' < £ 
such that V G Rii. Now since v G dRe+i by assumption, it 
must be that there exists v' G Re+i such that dG{v,v') = 1. 
Since by definition Re+i H Re' = 0, we have v' G dRe'. By 
induction hypothesis, this implies that v' G Be' C S^. That 
is. Be n 7^ 0, which is a contradiction to the definition 
of our algorithm. That is, our assumption that dRe+i (^t Be+i 
is false. Thus, we have established the inductive step. This 
completes the induction argument and proof of the Claim |2] 

■ 

Now when the algorithm terminates (which must happen 
within n iterations), say the output set is Bt and V~Bt ~ Tlx 
for some T. As noted above, TZt is a union of disjoint sets 
Ri, . . . , Rt. We want to show that i?^, Rj are disconnected for 
any 1 < i < j < T using Claim|2] Suppose to the contrary that 
they are connected. That is, there exists v G Ri and v' G Rj 
such that dG(w, v') = 1. Since Ri n Rj = 0, it must be that 
v' G dRi, V G dRj. From Claim |2] and fact that Bt C Bt+i for 
all t, we have that Ri n B =^ Rj n B ^ 9. This is contrary 
to the definition of the algorithm. Thus, we have established 
that i?i , . . . , Rt are disconnected components whose union 
is — Bt. By definition, each of Ri C B(wi_i, A'). Thus, 
we have established that V — Bt is made of connected 
components, each of which is contained inside balls of radius 
K with respect to do. Since, G has doubling dimension p. 
Lemma [T] implies that the size of any ball of radius K is at 
most {2K)P. Given choice of e < 1 and p > 1, we have that 
K > 2. Therefore, {2K)p < K^p. This completes the proof 
of (b) and that of Lemma |2] ■ 



C. Minor-excluded graphs 

Here we describe a simple and explicit construction of 
decomposition for graphs that exclude certain finite sized 
graphs as their minor This scheme is a direct adapation of 
a scheme proposed by Klein, Plotkin, Rao [15] and Rao [16]. 
We describe an (e, A) node-decomposition scheme. Later, 
we describe how it can be modified to obtain (e, A) edge- 
decomposition. 

Suppose, we are given graph G that excludes graph K^^r 
as minor Recall that if a graph excludes some graph Gr of r 
nodes as its minor then it excludes Kr.r as its minor as well. 
In what follows and the rest of the paper, we will always 
assume r to be some finite number that does not scale with 
n (the number of nodes in G). The following algorithm for 
generating node-decomposition uses parameter A. Later we 
shall relate the parameter A to the decomposition property of 
the output. 

Minor- v(G,r, A) 



(0) Input is graph G = {V, E) and r, A G N. Initially, i = 0, 
Go = G, S = 0. 

(1) For i = 0, . . . , r — 1, do the following. 

(a) Let S\, . . . ,S^j.. be the connected components of Gi. 

(b) For each < j < ki, pick an arbitrary node 

V, G S^. 

o Create a breadth-first search tree rooted at Vj 
in5j. 

o Choose a number Lj uniformly at random from 

{0,...,A-1}. 
o Let B^j be the set of nodes at level , A+Lj , 2A+ 

L),... in 77. 
o Update6 = 6u)li6i. 

(c) set i — i + 1. 

(3) Output B and graph G' = (V, E\B). 



As stated above, the basic idea is to use the following step 
recursively (upto depth r of recursion): in each connected 
component, say S, choose a node arbitrarily and create a 
breadth-first search tree, say T. Choose a number, say L, 
uniformly at random from {0, . . . , A — 1}. Remove (and add 
to B) all nodes that are at level L + kA, fc > in T. Clearly, 
the total running time of such an algorithm is 0{r{n + \E\)) 
for a graph G = {V,E) with \V\ = n; with possible parallel 
implementation across different connected components. 

Figure [T] explains the algorithm for a line-graph of n = 9 
nodes, which excludes i^2,2 as a minor. The example is about 
a sample run of MlNOR-v(G, 2, 3) (Figure [T] shows the first 
iteration of the algorithm). 

The following is the result that was in essence proved in 
[15], [16]. 

Lemma 3: If G excludes Kr,r as a minor Let B be the 
output of Minor- v(G, r, A). Then each connected component 
of V\B has diameter of size 0(A). 

Proof: This Lemma, for r = 3 was proved by Rao in 
[16] (Lemma 5 and Corollary 6 of [16]). The result is based 
on Theorem 4.2 of [15], which holds for any r. Therefore, the 
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Fig. 1. The first of two iterations in execution of MINOR- V(G, 2, 3) is sliown. 

result of Rao naturally extends for any r. This completes the 
justification of Lemma |3] ■ 
Now using Lemma [3] we obtain the following Lemma. 

Lemma 4: Suppose G excludes K^^r as a minor Let d* 
be maximum vertex degree of nodes in G. Then algorithm 
MlNOR-v(G,r, A) outputs B which is (r/A, d**^'"^') node- 
decomposition of G. 

Proof: Let i? be a connected component of V\B. From 
Lemma [3] the diameter of R is 0(A). Since d* is the 
maximum vertex degree of nodes of G, the number of nodes 
in R is bounded above by d*'-"-^\ 

To show that Pr(u & B) < r/A, consider a vertex v G V. 
If V ^ B in the beginning of an iteration < i < r — 1, 
then it will present in exactly one breadth-first search tree, 
say Tj. This vertex v will be chosen in Bj only if it is at 
level fcA + Lj for some integer fc > 0. The probability of this 
event is at most 1/ A since is chosen uniformly at random 
from {0, 1 . . . , A — 1}. By union bound, it follows that the 
probability that a vertex is chosen to be in B in any of the r 
iterations is at most r/A. This completes the proof of Lemma 

m ■ 

It is known that Planar graph excludes 1^3,3 as a minor. Hence, 
Lemma |4] implies the following. 

Corollary 2: Given a planar graph G with maximum vertex 
degree d*, then the algorithm MlNOR-v(G, 3, A) produces 
(3/A, 0?*'^^'^^) node-decomposition for any A > 1. 

We describe slight modification of MiNOR-V to obtain 
algorithm that produces edge-decomposition as follows. Note 
that the only change compared to MiNOR-V is the selection 
of edges rather than vertices to create the decomposition. 

MlNOR-E(G, r, A) 

(0) Input is graph G = {V, E) and r, A G N. Initially, i = 0, 
Go = G, S = 0. 

(1) For « = 0, . . . , r — 1, do the following. 
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(a) Let S\, . . . ,S]^, be the connected components of G;. 

(b) For each S"*,! < j < ki, pick an arbitrary node 

V, G S]. 

o Create a breadth-first search tree TJ* rooted at Vj 
\nS]. 

o Choose a number uniformly at random from 

{0,...,A-1}. 
o Let be the set of edges at level U- , A+L* , 2A+ 

Lj,... in 77. 
o Update S = S U*^Li B]- 

(c) set 2 = i + 1. 

(3) Output B and graph G' = {V, E\B). 

Lemma 5: Suppose G excludes Kr^r as a minor Let d* 
be maximum vertex degree of nodes in G. Then algorithm 
Minor-e(G, r, A) outputs B which is {r / A,d*'~'^^^) edge- 
decomposition of G. 

Proof: Let G* be a graph that is obtained from G by 
adding center vertex to each edge of G. It is easy to see that 
if G excludes K,..r as minor then so does G* . 

Now the algorithm MlNOR-E(G, r, A) can be viewed 
as executing MlNOR-v(G*, r, 2A-1) with modification that 
the random numbers i^s are chosen uniformly at random 
from {1,3, 5,... 2A — 1} instead of the whole support 
{1, 2, . . . , 2A — 1}. To prove Lemma |5] we need to show that: 
(a) each edge is part of the output set B with probability at 
most r/A, and (b) each of the connected component of V\B 
is at most d*°^^\ 

The (a) follows from exactly the same arguments as those 
used in Lemma|4] For (b), consider the following. The Lemma 
13 implies that if the algorithm was executed with the random 
numbers L*s being chosen from {1,2,...,2A — 1}, then 
the desired result follows with probability 1. It is easy to 
see that under the execution of the algorithm with these 
choices for random numbers, with strictly positive probability 
(independent of n) all the L* s are chosen only from the odd 
numbers, i.e. {1, 3, 5, ... 2A — 1}. Therefore, it must be that 
when we restrict the choice of numbers to these odd numbers, 
the algorithm must produce the desired result. This completes 
the proof of Lemma [5] ■ 

Figure |2] explains the algorithm for a line-graph of n = 9 
nodes, which excludes K2.2 as a minor The example is about 
a sample run of Minor-e(G, 2, 3) (Figure |2] shows the first 
iteration of the algorithm). 

IV. Approximate log Z 

Here, we describe algorithm for approximate computation 
of logZ for any graph G. The algorithm uses an edge- 
decomposition algorithm as a sub-routine. Our algorithm pro- 
vides provable upper and lower bound on log Z for any graph 
G. In order to obtain tight approximation guarantee, we will 
use specific graph structures as in low doubling dimension and 
minor-excluded graph. 

A. Algorithm 

In what follows, we use term Decomp for a generic edge- 
decomposition algorithm. The approximation guarantee of the 
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output of the algorithm and its computation time depend on the 
property of Decomp. For graph with low doubhng dimension, 
we use algorithm DB-DlM(over the edge graph) and for graph 
that excludes Kr.r as minor for some r, we use algorithm 
MiNOR-E. 



Log PartitionCG) 



(1) Use Decomp(G') to obtain B C E such that 

(a) G' = {V,E\B) is made of connected components 
Si,. . . , Sk- 

(2) For each connected component Sj,l < j < K, do the 
following: 

(a) Compute partition function Zj restricted to Sj by 
dynamic programming (or exhaustive computation). 



(3) Let 



max(j,_2./)gs2 i^ijix^x'). Then 



'i-iT^(x,x'}G^^ipijix,x'), -0. 



K 



log^LB =^l0gZ, + J2 ^' 



K 



l0gZuB=^l0gZ,+ 

(4) Output; lower bound log^LB and upper bound logZuB- 

In words. Log Partition(G) produces upper and lower 
bound on log Z of MRF G as follows: decompose graph G into 
(small) components S\, . . . , Sk by removing (few) edges B C 
E using Decomp(G). Compute exact log-partition function in 
each of the components. To produce bounds log Zlb , log Zub 
take the summation of thus computed component-wise log- 
partition function along with minimal and maximal effect of 
edges from B. 



B. Analysis of Log Partition.- General G 

Here, we analyze performance of Log Partition for any 
G. Later, we will use property of the specific graph structure 
to obtain sharper approximation guarantees. 

Theorem 1: Given a pair-wise MRF G, the Log Parti- 
tion produces log Zlb , log Z\jb such that 

log Zlb < log Z < log ZuB , 



l0gZuB-l0gZLB= J2 



It takes 0^|£'||E|''^ 'j + Tqecomp time to produce this es- 
timate, where \S*\ = maxj^^ \Sj\ with Decomp producing 
decomposition of G into 5*1 , ... , Sk in time Tdecomp ■ 

Proof: First, we prove properties of log Zlb , log Z\jb as 
follows: 



log Zlb = Zj + Y 



K 



(a) 



log 



xGE" \i 



E 



(6) 

< log 



E *^^p ( E*^*^^*) 



+ E ^ijiXi,Xj)+ Y Aj{Xi,X.j) 
{i.j)eE\B {i-j)eB 

\ogZ 



< loe 



{i,j)GE\B {hj)&B 



id) 



K 



logZuB- 



We justify (a)-(d) as follows: (a) holds because by removal of 
edges B, the G decomposes into disjoint connected compo- 
nents 5*1, ... , Sk', (b) holds because of the definition of tp^^; 
(c) holds by definition ip^j and (d) holds for a similar reason 
as (a). The claim about difference log Zub — log ^lb in the 
statement of Theorem [T] follows directly from definitions (i.e. 
subtract RHS (o) from (d)). This completes proof of claimed 
relation between bounds log Zlb , log Zub ■ 

For running time analysis, note that Log Partition per- 
forms two main tasks: (i) Decomposing G using DECOMP al- 
gorithm, which by definition take Tdecomp time, (ii) Comput- 
ing Zj for each component Sj through exhaustive computa- 
tion, which takes OdiJjUSl''^^') time (where Ej are edges 
between nodes of Sj) and producing log Zlb , log Zub takes 
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addition \E\ operations at the most. Now, the maximum size 
among these components is Further, the UjEj C E. 

Therefore, we obtain that the total running time for this task 
is 0(|i?||I]|''' '). Putting (i) and (ii) together, we obtain the 
desired bound. This completes the proof of Theorem [T] ■ 



exp 




Justification of (o)-(c): (o) follows since V 



I] 7 ' 



(11) 



are 



functions. 



C. Some preliminaries 



Before stating precise approximation bound of Log Par- 
tition algorithm for graphs with low doubling dimension 
and graphs that exclude minors, we state two useful Lemmas 
about log Z for any graph. 

Lemma 6: If G has maximum vertex degree d* then, 



(a) consider the following 
assign {xi,Xj) for each 



logZ > 



non-negative 

probabilistic experiment: 

(ij) e M equal to or {xf,xf) 

probability 1/2 each. Under this experiment, the 
expected value of the exp{J2(^i j-f(zM ''^iji^i' ^j))' which 
exp(fc(.f,.^))+exp(fc(.r..y))^ is equal to 

2"'^^'ExeQexp(X;(,,j)gM'/'u(3^M2:j))]- Now, use the fact 
that ipij{xf ,Xj') > ipfy (b) follows from simple algebra 
and (c) follows by using non-negativity of function 
Therefore, 



\ogZ > 



Proof: Assign weight wij = i/'f^ — -0/^ to an edge 
G E. Since graph has maximum vertex degree d*, by 
Vizing's theorem there exists an edge-coloring of the graph 
using at most d* + 1 colors. Edges with the same color form 
a matching of the G. A standard application of Pigeon-hole's 
principle implies that there is a color with weight at least 
TF+li^ii j)eE^ij)- ^ denote these set of edges. 

That is. 



> 




(12) 



using fact about weight of M. This completes the proof of 
Lemma |6] ■ 
Lemma 7: If G has maximum vertex degree d* and the 
Decomp(G) produces B that is (e, A) edge-decomposition, 
then 



log ZuB - log Zlb 



<e(d* + l)logZ, 



E 



w.rt. the randomness in B, and Log Partition takes time 
0{nd*\ll\'^) + Tdecomp- 

Proof: From Theorem [T] Lemma |6] and definition of 



., ^ v^n r • r,iA/i .J r 11 (e, A) edge-decomposition, we have the following. 

Now, consider a Q C S of size 2' ' created as follows. v ' / o r ^ a 

For e M let {x^ ,x^) £ argmax(^^^/)gs2 V'ij (a;, a;')- 

For each i eV, choose G E arbitrarily. Then, 



E 



log ZuB - log ^LB 



< E 



Q = {x G S" : V (z , J ) e A/, (.T, , X, ) = (x,^ , ) or 
[x^ ,x^)]iox all other i £V, Xi ^ xf}. 

Note that we have used the fact that M is a matching for Q 
to be well-defined. 

By definition </>,; , ipij are non-negative function (hence, their 
exponents are at least 1). Using this property, we have the 
following: 



E 

(i.3)eB 



= E Pr((^,j)GS)(V'.^-^,^-) 



< e 



Z > 



(o) 

> 



E '^^p I E ^-^ + E 



E *^^P E V'y(a;»,a;j) 
> 2l^^l Jl exp(V>,^-) + exp(V>,'^) 



< e(d* + l)logZ. 

Now to estimate the running time, note that under (e, A) 
decomposition B, with probability 1 the G" = (V, E\B) is 
divided into connected components with at most A nodes. 
Therefore, the running time bound of Theorem [T] implies the 
desired result. ■ 

D. Analysis of LOG PARTITION.- Low doubling-dimension G 

Here we interpret result obtained in Theorem [T] and Lemma 
|7] for G that has low doubling-dimension and uses decompo- 
sition scheme Db-DIM. 
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Theorem 2: Let MRF graph G of n nodes with doubling 
dimension p be given. Consider any e G (0, 1). Define ip = 
g2-p-3 jjjgjj Partition using DB-mM{ip, K{ip, p)) 
produces bounds logZLB, log^^uB such that 



E 



log ZuB - log Zis. < £ log Z. 



The algorithm takes 0{n2PCo{e, p)) time to obtain the es- 
timate, where Co(e,p) = m'^^f.pf^ Further, if p{p + 
logl/e) = o(loglogn) then the algorithm takes o(n^+'^) 
amount of time for any S > 0. 

Proof: The Lemma |2l Lemma |7] and Theorem [T] implies 
the following bound: 



E 



logZuB -log^LeJ < 62- P-^d* + 1) log Z 

< elogZ. (13) 



Now for graph with doubling dimension p, \E\ = n2P. Under 
the decomposition algorithm with parameter ip and K{ip,p), 
the number of nodes in any component is at most K{(p, p)^P- 
Therefore, by Lemma |2] the desired bound on running time 
follows. 

Now, consider when condition p{p + log = 
o(loglogn). Given ip = e2^P^^, 



plogp/ip = /9(log/9 + p + 3 + logl/e) 
= e(p2+p logl/e) 
= o(loglogn), 



(14) 



from the above described hypothesis of the Theorem. Now, 
DB-DlM{(p,K{(p,p)) produces (e2"''"^, 0(log^/'^ n)) edge- 
decomposition from Corollary [T] We select L = 2. Given 
this and above arguments, we have that the running time of 
the algorithm is o(ri^+*)) for any (5 > 0. This completes the 
proof of Theorem |2] ■ 

E. Analysis of LOG PARTITION.- Minor-excluded G 

We apply Theorem [T] and Lemma |7] for minor-excluded 
graphs when the Decomp procedure is essentially the 
MiNOR-E. We obtain the following precise result. 

Theorem 3: Let MRF graph G of n nodes exclude Kr.r as 
its minor Let d* be the maximum vertex degree in G. Given 
e > 0, use Log Partition algorithm with Minor-e(G, r, A) 
where A = irK+ll]. Then, 



log Zlb < log Z < log ZuB ; and 
E log ZuB - log Zlb < e log Z. 
Further, algorithm takes (nC(c?*, jSj, e)), where constant 

j*0(A) 

C{d*,m,e) = d*\T.\ . Therefore, if e-^d*logd* = 
o(loglogn), then the algorithm takes o{ri^^^) steps for ar- 
bitrary S > 0. 

Proof: From Lemma |5] about the MiNOR-E algorithm, 
we have that with choice of A = f '''-'^^^^-' ], the algorithm 
produces (e, A) edge-decomposition where A = d*'-"-^K 
Since it is an (e, A) edge-decomposition, the upper bound 
and the lower bound, log Z^q, log Zlb, for the value produced 
by the algorithm are within (lie) logZ by Lemma |7] 



Now, by Lemma [T] the running time of the algorithm is 
0{nd*\T,\^) + T'decomp- As discussed earlier in Lemma |5j the 
algorithm MiNOR-E takes 0{i^\E\) ~ 0{nrd*) operations. 
That is, Tdecomp 0{nrd*). Now, A = d*'^'^^^ and A < 
r{d* + l)/e + 1. Therefore, the first term of the computation 
time bound is bounded above by 



^.0(rd'/e) 

Now, we will establish that the above term is O(n^) under 
the hypothesis e^^d* \ogd* ~ o(loglogn). The hypothesis 
implies that (since r a constant, not scaling with n): 

Alogd* = o(loglog7T,). 

That is, for any finite L (say, L = 2) we have that 

A = 0(log^/^n). 

This in turn implies that, for finite |E| we have 

|Sr = o(n^/2), 

for any (5 > 0. Since d* = o(loglogn) — 0{n^^^). Therefore, 
it follows that 

This completes the proof of Theorem [3] ■ 



V. Approximate MAP 

Now, we describe algorithm to compute MAP approxi- 
mately. It is very similar to the LOG PARTITION algorithm: 
given G, decompose it into (small) components Si, ... , Sk by 
removing (few) edges B C E. Then, compute an approximate 
MAP assignment by computing exact MAP restricted to the 
components. As in Log Partition, the computation time 
and performance of the algorithm depends on property of 
decomposition scheme. We describe algorithm for any graph 
G; which will be specialized for graph with low doubling di- 
mension and graph that exclude minor by using the appropriate 
edge-decomposition schemes. 

MODE(G) 



{V,E) with 0,(-),i 



V, 



(0) Input is MRF G 

V'y (•,•),(*- j) e E. 

(1) Use Decomp(G) to obtain B c E such that 

(a) G' = {V,E\B) is made of connected components 
S'l, . . . , Sk- 

(2) For each connected component Sj,l < j < K, do the 
following: 

(a) Through dynamic programming (or exhaustive com- 
putation) find exact MAP x*'^ for component Sj, 
where x*J = (x*"')jgs^.. 

(3) Produce output x*, which is obtained by assigning values 
to nodes using x*-^ , 1 < j < K. 
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A. Analysis of MODE; General G 



B. Some preliminaries 



Here, we analyze performance of Mode for any G. Later, 
we will specialize our analysis for graph with low doubling 
dimension and minor excluded graphs. 

Theorem 4: Given an MRF G described by ([T]), the 
Mode algorithm produces outputs x* such that: 

H(x*)- (^.^-^5) <W(^)<W(x*). 

The algorithm takes O (^E\K\Y.\^^ + Tqecomp time to 
produce this estimate, where \S*\ = maxj^j^ \Sj\ with De- 
COMP producing decomposition of G into 5*1, ... , Sk in time 

Tdecomp ■ 

Proof: By definition of MAP x*, we have H(x*) < 
H{x*). Now, consider the following. 



max 

xGS" 



max 



(jj)6iJ\f5 



(a) 

< max 



(ij)e_E\e 



(fc) 



K 

E 



■ E 



max 7i(x^ 



E '^.^ 



EW(X*-'") 



E 



< 7^(x*) 



E ^^^-^^ 

{i.])eB 



(15) 



We justify (a)-(d) as follows: (a) holds because for each edge 
e B, we have replaced its effect by maximal value -0^^; 
(b) holds because by placing constant value il'^j over S 
B, the maximization over G decomposes into maximization 
over the connected components of G" ~ {V,E\B); (c) holds 
by definition of x*'^ and (d) holds because when we obtain 
global assignment x* from x*'-* , I < j < K and compute its 
global value, the additional terms get added for each G B 
which add at least ipfj amount. 

The running time analysis of Mode is exactly the same as 
that of Log Partition in Theorem [U Hence, we skip the 
details here. This completes the proof of Theorem |4] 



This section presents some results about the property of 
MAP solution that will be useful in obtaining tight approxi- 
mation guarantees later First, consider the following. 

Lemma 8: If G has maximum vertex degree d*, then 
1 



H(x*) > 



> 



d* + 1 
1 

d* + 1 



E 'A.^ 

(»,i)eB 



E ^3-^ 



(16) 



Proof: Assign weight Wij ~ ijjf^ to an edge (i, j) G E. 
Using argument of Lemma |6] we obtain that there exists a 
matching M C E such that 



E ^r3> 



1 



E 4 



Now, consider an assignment x^^ as follows: for each G 
M set {xf^ ,Xj'^) = argmax(2. 2,/)g5]2 ijjij{x,x'); for remain- 
ing i G set xf^ to some value in E arbitrarily. Note that 
for above assignment to be possible, we have used matching 
property of M. Therefore, we have 



H(x^>0 



E'^*(= 

iev 



E ^'.(^f- 



,M\ 



iev 



(i,j)eE\M 



/ , ^ij (-^i I •'^j ) 



(a) 

> 



+ E 

{i,j)eM 

E V'.,(xf,xf) 



E/ 

(zj)6M 



> 



1 



E 



(17) 



d* + 1 

Here (a) follows because ipij , 0, are non-negative valued 
functions. Since H(x*) > ■^(x^^) and > for all 
G E, we obtain the Lemma H] ■ 

Lemma 9: If G has maximum vertex degree d* and the 
Decomp(G) produces B that is (e, A) edge-decomposition, 
then 



E 



H(X*)-W(X*) < £((i* + 1)H(X*), 



where expectation is w.rt. the randomness in B. Further, 
Mode takes time 0{nd*\Y,\^) + Tdecomp- 
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Proof: From Theorem |4l Lemma [8] and definition of 
(e, A) edge-decomposition, we have the following. 



E 



7i(x*) -H(x*) 



< E 



^ Pr((z,j)eS)(# 



< e 



E 



< 



l)H(x*). 



(18) 



The running time bound can be obtained using arguments 
similar to those in Lemma |7] ■ 



C. Analysis of MODE; Low doubling dimension G 

Here we interpret result obtained in Theorem |4] and Lemma 
m for G that has low doubling-dimension and uses decompo- 
sition scheme Db-DIM. 

Theorem 5: Let MRF graph G of rt nodes with doubling 
dimension p be given. Consider any e S (0,1) such that 
/ci(p + logl/e) = o(loglogn), define Lp = £2^p^^. Then 
Mode using DB-DiM((y5, K{if, p)) produces bounds x* such 
that 



7^(x*)-H(x*) <e?^(x*). 

The algorithm takes 0{n2PCo{e, p)) time to obtain the es- 
timate, where Co{e,p) = Further, if p{p + 
logl/e) = o(loglogn) then the algorithm takes o{n^^^) 
amount of time for any (5 > 0. 

Proof: Theorem |4] Lemma |9] and Lemma |2] imply that 
the output produced by MODE algorithm is such that 

E \n{x*) - n^?)] < e2-P-^{d* + l)7^(x*) 

< £H(x*), (19) 

because d* + 1 < 2''+^ for a graph with doubling dimension 
p. The running time analysis of the algorithm follows exactly 
the same arguments as those in the proof of Theorem |2] 



D. Analysis of MODE.- Minor-excluded G 

We apply Theorem |4] and Lemma |9] for minor-excluded 
graphs when the Decomp procedure is the MiNOR-E. We 
obtain the following precise result. 

Theorem 6: Let MRF graph G of n nodes exclude Kr.r as 
its minor Let d* be the maximum vertex degree in G. Given 
e > 0, use Mode algorithm with Minor-e(G, r, A) where 



Then, 



E 



W(x*) - 7t:(x*) <eH(x*). 



Further, algorithm takes (riG((i*, jSj, e)), where 

j*0(A) 

C{d*,m,e) = d*|Sr . Therefore, if e-^d* logd* = 
o(loglogn), then the algorithm takes o{n^^^) steps for 
arbitrary (5 > 0. 



Proof: From Lemma |5] about the MiNOR-E algorithm, 
we have that with choice of A = ^.j^^ algorithm 

produces (e, A) edge-decomposition where A = d*'~'^^\ 
Since its an (e, A) edge-decomposition, from Lemma |9] it 
follows that 



E 



n{^*)-n{^) <eW(x*) 



Now, by Lemma |9] the algorithm running time is 
0{nd*\Y,\'^) +Tdecomp- As discussed earlier in Lemma|5] the 
algorithm MiNOR-E takes 0{r\E\) ~ 0{nrd*) operations. 
That is, Tdecomp = 0{nrd*). Now, A = d*'^'-^^ and A < 
r{d* + l)/e + 1. Therefore, the first term of the computation 
time bound is bounded above by 

/ ■*0(^-ci*/e)^ 

O nd*\n 



Now, we will establish that the above term is 0{n^) under 
the hypothesis e~^d* \ogd* = o(loglog7i). The hypothesis 
implies that (since r a constant, not scaling with n): 

Alogd* = o(loglogn). 

That is, for any finite L (say, L = 2) we have that 

A = 0(log^/^n). 

This in turn implies that, for finite we have 

for any S > 0. Since d* = o(loglogn) — 0{n^^'^). Therefore, 
it follows that 

This completes the proof of Theorem |6] ■ 
VI. Message-passing implementation through 

SELF-AVOIDING WALK 

The approximate inference algorithms. Log PARTI- 
TION and Mode presented above are local in the sense 
that in order to make computation, the centralization of the 
algorithm is limited only up to each connected component. 
This section provides a method for designing message-passing 
implementation for computing these estimates using the self- 
avoiding walk trees. This message passing algorithm is ex- 
plained for MAP computation and is restricted to binary MRF. 
It is worth noting that any MAP estimation problem over 
discrete pair-wise exponential family can be converted into 
a binary pair-wise MRF with the help of addition nodes. This 
is explained in Appendix IB] Thus, in principle, this message 
passing algorithm can work for any discrete valued Markov 
random field represented by a factor graph. 

A. Equivalence: MRF and Self-Avoiding Walk Tree 

The first result is about equivalence of max-marginal of a 
node, say v, in an MRF G and max-marginal of root of self- 
avoiding walk tree with respect to v. Dror Weitz [18] showed 
such equivalence in the context of marginal distributions of 
the nodes. We establish the result for max-marginal. However, 
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the proof is a direct adaption of the proof of result by Weitz 
[18]. 



TsAwiG, 1) 




TcoAip{G, 1) 



Graph G 




-'13 



2 4 




.3 


■023 


^34/ 


\ ^^23 


si 




\ 2 


V'3y 


\ 013 


■012 




\l 


1 



Fig. 3. A graph G of 4 nodes with one loop is given. On left, we have the 
self-avoiding walk tree of G for node 1, i.e. TgAwiG, 1) with green and red 
being special nodes. On right, we have computation tree Tqomp{G, 1) for 
node I's computation under Belief Propagation (or Max-Product) algorithm. 
The grey nodes of T(70A/p(G, 1) correspond to green and red node of 
TsAw{.G,l) on the left. 



Given binary pair-wise MRF G of n nodes, our interest is 
in finding 



pI{i) = , max 



Pr(cr), for 7 G {0, 1} for all v. 



Definition 1 (Self-Avoiding Walk Tree): Consider graph 
G = (V, E) of pair-wise binary MRF. For v € V, we define 
the self avoiding walk tree Tsaw{G,v) as follows. First, for 
each u G V, give an ordering of its neighbors N{u). This 
ordering can be arbitrary but remains fixed forever. Given 
this, TsAw{G,v) is constructed by the breadth first search 
of nodes of G starting from v without backtracking. Then 
stop the bread-first search along a direction when an already 
visited vertex is encountered (but include it in TsAwiG,v) 
as a leaf). Say one such leaf be w of Tsaw{G,v) and let 
it be a copy of a node w in G. We call such a leaf node 
of TsAw{G,v) as Marked. A marked leaf node is assigned 
color Red or Green according to the following condition: The 
leaf w is marked since we encountered node it) of G twice 
along our bread-first search excursion. Let the (directed) 
path between these two encounters of ly in G be given by 
{w,vi, . . . jVkjw). Naturally, vi,Vk G N{w) in G. We mark 
the leaf node w as Green if according to the ordering done 
by node w in G of its neighbors, if Vk is given smaller 
number than that of vi . Else, we mark it as Red. Let V„ and 
E„ denote the set of nodes and vertices of tree TsAwiG,v). 
With little abuse of notation, we will call root of Tsaw{G, v) 
as V. 

Given a Tsaw{G,v) for a node v E V in G, an MRF 



is naturally induced on it as follows: all edges inherit the 
pair-wise compatibility function (i.e. -0. .(•,•)) and all nodes 
inherit node-potentials (i.e. 0. (•)) from those of MRF G in 
a natural manner The only distinction is the modification 
of the node-potential of marked leaf nodes of Tsaw{G,v) 
as follows. A marked leaf node, say w of Tsaw{G,v) 
modifies its potentials as follows: if it is Green than it sets 
0t2(l) = 0u)(l), 0u)(O) = but if it is Red leaf node then it 
sets 0^(0) = 0^(0), 0^(1) = 0. 

Example 1 (Self-avoiding walk tree): Consider 4 node bi- 
nary pair-wise MRF G in Figure |3] Let node 1 gives number 
a to node 2, number h to node 3 so that a > b. Given this 
numbering, the bottom left of Figure[3]represents Tsaw{G, 1). 
The Green leaf node essentially means that we set its value 
permanently to 1. 

With above description, Tsaw{G,v) gives rise to a pair- 
wise binary MRF. Let Qg,v denote the probability distribution 
induced by this MRF on boolean cube {0, l}'^"'. Our interest 
will be in the max-marginal for root v or equivalently 



max 
o-e{o,i}iv„i: 



}G,v{cr), where 7 G {0, 1}. 



Here we present an equivalence between and q*{-). 

This is a direct adaptation of result by Weitz [18]. 

Theorem 7: Consider any binary pair-wise MRF G = 
{V,E). For any v E V, let be as defined above with 

respect to Prg. Let Tsaw{G,v) be the self-avoiding walk 
tree MRF and let q*{-) be as defined above for root node of 
TsAw{G,v) with respect to Qg.v Then, 



Here we allow ratio to be 0, cxo. 



(20) 



Proof: The proof follows by induction. As a part of the 
proof, we will come across graphs with some fixed vertices, 
where a vertex u is said to be fixed to (resp. 1) if 0u(O) > 
, 0„(1) = (resp. 0„(1) > , 0„(O) = 0). The induction 
is on the number of unfixed vertices of G. We essentially 
prove the following, which implies the statement of Lemma: 
given any pair-wise MRF on a graph G (with possibly some 
fixed vertices), construct corresponding TsawIG, v) MRF for 
some node v. If the number of unfixed vertex of G is at most 
m, then the ( |20] | holds. Next, inductive proof. 

Initial condition. Trivially the desired statement holds for 
any graph with exactly one unfixed vertex, by definition of 
MRF, i.e. ([T]i. The reason is that for such a graph, due 
to all but one node being fixed, the max-marginal of each 
node is purely determined by its immediate neighbors due to 
Markovian nature of MRF. The immediate neighborhood of v 
in TsAw{G^ v) and G is the same. 

Hypothesis. Assume that the statement is true for any graph 
with less than or equal to m G N unfixed nodes. 

Induction step. Without loss of generality, suppose that our 
graph of interest, G, has m + 1 unfixed vertices. If v is a 
fixed vertex, then ( l20l i holds trivially. Let w G F be an unfixed 
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vertex of G. Then we will show via inductive hypothesis that 

Let d be the degree of v; vi,V2, ■ ■ ■ , be the neighbors 
of V where the order of neighbors is the same as that used 
in definition of Tsaw{G,v). Let be the £th subtree of 
TsAw{G, i) having vi as its root and Y{li) be the binary pair- 
wise MRF induced on by restriction of Tsaw{Gtv). Let 
ql ((t) be the max-marginal of vertex vi taking value cr G E = 
{0, 1} with respect to Y{t). Note that when consists of a 
single vertex, then ql{a) oc 0t,^((T). Let A„ = Then 
from definition of pair-wise MRF and tree-structure, 



9^(0) 



maxg-gs il)vi,v{a, l)ql{cr) 
max^GE il}vi,v{(T., 0)q}{a) 



(21) 



Now to calculate we define a new graph G' and 

the corresponding pair-wise MRF X' as follows. Let G' 
be the same as G except that v is replaced by d vertices 
v[,v'2, ■ . ■ ,v'j^; each v'^ is connected only to vg, 1 < £ < d. 
The X' is defined same as X except that = Xl/'^(f>v{l), 

Then, 



(?!)„j(0) = (pviO) and ^^^y'^ = ?/>„^„. 
max r 



= 1.X' 



Prc'iX') 



pm 



X':X' =0,X' ,=0,. 



,X' , =ol 



n 



Ml) 

where define /i£(cr) = 

o,...,x',, =o,x:, 



(22) 



1, . . . , X^,, = 1]. The second 



, (* + l) d 

equality in ( I22l i follows by standard trick of Telescoping 
multiplication and Lemma [TOl 

Now for 1 < £ < d, consider MRF X'{£) induced on 
G'{£) = G" - {v'l,} by fixing {v[, ...v'^}- {v[} as follows: 
let (0,, (0) = 1, (1) = 0); . . . ; (0,. ^ (0) = 1, c/.,^ ^ (1) = 

0) ;(0„;^^(O) = 0,0^.^^(1) = 1);... (0) =O,0,„.(1) = 

1) . Then let vi{a),a g S denote the max-marginal of vg for 
taking value a with respect to Given this, by definition 
of MRF X' as well X'{1) and noting that v[ is a leaf (only 
connected to vi) with respect to graph G", we have 



(23) 



(24) 



/i£(0) " max<^gE V«f,t;^(o',0)t'£(cr)' 

From (I2TI ). ( |22] | and ( [23] l it is sufficient to show that 

i/,(0) g;(0)' - - 

Now, note that Tg is the same as TsAw(G{iy) with respect 
to X'{e). Because for each £ = l,...d, G'(£) has one 
less unfixed node than G, the desired result ( l24b follows by 
induction hypothesis. ■ 
Lemma 10: Consider a distribution on X = (Xi, . . . , X„) 
where are binary variables. Let ps = Pr[X = s], s G S". 
Let Ps\a2,...Ma = Prf^ = s\X2 = 02, . . . , Xrf = a^] for any 



d > 1. Let ^(ai, . . . , a^) = {s = (si, . . . , s„) G T," : Si ~ 
ai, . . . , Srf = ad}. Then, 

ni^X5eS(ai,a2...,ad) _ "'^^^s£S{ai,a2...,aa) Ps\a2.,...,aa 
^'<^^seS(ai,a2,...,aa) Ps ^^^sGS{ai,a2,...,aa) Ps\a2,...,aa 

Proof: Let q = Pr(X2 = 02,..., = a^). 
Then, by definition of conditional probability for s G 
S'(ai, 02, . . . , ad) U S'(ai, 02, . . . , a<j), Ps Ps\a2,...Maq- Fro™ 
this. Lemma follows immediately. ■ 

B. 5;ze of Self-avoiding walk tree 

We present a novel characterization of the size of the self 
avoiding walk tree in terms of number of edges in it (which 
is equal to number of nodes minus 1). This characterization is 
necessary to obtain bound on the running time of the self- 
avoiding walk tree. This combinatorial result should be of 
interest in its own right. 

Lemma 11: Consider a connected graph G = {V, E) with 
\V\ = n nodes and \E\ = 71 — 1 + fc edges, fc > 0. Then for 
any v G V, \Tsaw{G,v)\ < {n + k ~ l)2'=+i. Further, there 
exists a graph with n — 1 + k edges with k < n/2 so that for 
any node veV, \Tsaw{G,v)\ > n2''-^. 

Proof: The proof is divided into two parts. We first 
provide the proof of lower bound. Consider a Une graph of n 
nodes (with n—1 edges). Now add k < n/2 edges as follows. 
Add an edge between 1 and n. Remaining fc — 1 edges are 
added between node pairs: (2, 4), (4, 6), . . . , (2(fc - 2), 2(fc - 
1)), (2(fc — l),2fc). Consider any node, say v. It is easy to 
see that there are at least 2''~^ different ways in which one 
can start walking on the graph from node v towards node 1, 
cross from 1 to n via edge (1, n) and then come back to node 
V. Each of these different loops, starting from v and ending 
at V creates 2 distinct paths in the self-avoiding walk tree of 
length at least^. Thus, the size of self-avoiding walk tree of 
each node is at least n2'^'~^ for each node. This completes the 
proof of lower bound. 

Now, we prove the upper bound of 7i2'^+^ on the size of 
self-avoiding walk tree for each node v G V. Given that G is 
connected, we can divide the edge set E = Et U Ek where 
Ek = {ei, . . . , Cfc} and T = {V, Et) forms a spanning tree 
of G. Let iS be the set of all subsets of Ek = {ei, . . . ,efc} 
(there are 2*^ of them including empty set). Now fix a vertex 

V G V and we will concentrate on Tsaw{G, v). Consider any 
u G V (can be v) and S € S. Next, we wish to count number 
of paths in Tsaw{G,v) that end at (a copy of) u (however, 
u need not be a leaf), contain all edges in S but none from 
Ek\S. We claim the following. 

Claim. There can be at most one path of Tsaw{G, v) from 

V to (a copy of) u and containing all edges from S but none 
from Ek\S. 

Proof: To prove the above claim, suppose it is not true. 
Then there are at least two distinct paths from w to u that 
contain all edges in S (but none from Ek\S). Consider the 
symmetric difference of these two paths (in terms of edges). 
This symmetric difference must be a non-empty subset of Ex 
and also contain a loop (as the two paths have same starting 
and ending point). But this is not possible as T ~ {V,Et) 
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is a tree and it does not contain a loop. This contradicts our 
assumption and proves the claim. ■ 
Given the above claim, for any node u, clearly the number 
of distinct paths from node v to (a copy of) u in Tsaw{G, v) 
are at most 2'"'. Now each edge has two end points. For each 
appearance of an edge of G in Tsaw{G,v), a distinct path 
from V to one of its end point must appear in Tsaw{G,v). 
From above claim, this can happen at most 2x2*^ = 2*^+^. 
There are ?7 + fc — 1 edges of G in total. Thus, net number 
of edges that can appear in Tsaw{G, v) is at most {n + k — 
1)2*^+^; thus completing the proof of Lemma [TT] ■ 

C. Algorithm: At a higher level 

Now, we describe algorithm to compute MAP approx- 
imately. The algorithm is the same as Mode, however 
computation restricted to each component is done through self- 
avoiding walk. Specifically, the algorithm does the following: 
given G, decompose it into (small) components iSi , . . . , Sk 
by removing (few) edges B C E, where B is obtained 
using Decomp; (as before, for minor-excluded graph use 
MiNOR-E and Db-DIM for graphs with low doubling 
dimension). Then, compute an approximate MAP assignment 
by computing exact MAP restricted to the components. This 
exact computation for each component is performed through 
a message passing mechanism using the equivalence stated in 
Theorem |2l essentially, growing self-avoiding walk tree is just 
sending messages along a breadth-first search tree; computa- 
tion over a self-avoiding walk tree is essentially standard max- 
product (message passing) algorithm. The precise schedule for 
message-passing is described in the next sub-section. Here, we 
describe algorithm for any graph G at a higher-level. 
MODE(G) 

(1) Use Decomp(G) to obtain B C E such that 

(a) G' = {V, E\B) is made of connected components 
Si, ... , Sk. 

(2) For each connected component 5*^ , 1 < j < K, do the 
following: 

(a) Compute exact MAP x*'^ for component Sj, where 

(b) Computation of x*'-* is performed by growing self- 
avoiding walk tree at node i restricted to induced 
graph by nodes of 5*^ using a message passing 
mechanism; then computing max-marginal on self- 
avoiding walk tree using message passing mecha- 
nism (i.e. standard max-product algorithm on self- 
avoiding walk tree). 

(3) Produce output x*, which is obtained by assigning values 
to nodes using x* -*, 1 < j < K. This is clearly local 
operation. 



D. Algorithm Message-passing schedule 

The following is a pseudo-code of a distributed message 
passing algorithm Msg-Pass-Mode which computes x*'-' 
for each component Sj. The Msg-Pass-Mode finds exact 
MAP, by Theorem [7] This section is of interest primarily for 



the reason that it provides the detailed distributed message- 
passing implementation for computing MAP. A reader, not 
interested in such detailed implementation, may skip this 
section. 

To describe the pseudo-code, we need some notation. Each 
node V ^ V, let N{v) denote the set of all its neighbors, 
i.e. N{v) ~ {u E V : {u,v) G E}. Node v assigns an 
arbitrary fixed order to all nodes in N{v). For example, if 
V has neighbors u, w and z then it can number u as the first 
neighbor, w as second neighbor and z as third neighbor. The 
ordering chosen by each node is independent of choices of 
all other nodes. The algorithm operates in two phases. In 
the first phase, algorithm explores local topology for each 
node via sending "path sequences". By "path sequence" we 
mean a finite sequence of vertices {vi,V2, . . . ,Vk), where 
G E for 1 < £ < fc — 1. In the second phase, 
algorithm uses the path sequences to recursively calculate 
"computation sequence" which in turn leads to calculation 
of q*{-) at nodes. A "computation sequence" is of the form 
(wi, U2, ■ • • , Wfc, m^jj. (0), (1)), where m.u^{-) are certain 
real-numbers (which have interpretation of message). As we 
shall see, the structure of recursive calculation to obtain 
"computation sequence" is the same as that of max-product 
algorithm. Thus, there is very strong connection between MP 
and Msg-Pass-Mode. For ease of exposition, the algorithm 
is described to compute the ratio (7,*(l)/q*(0) for all v eV. 

Msg-Pass-Mode(G) 

(0) Initially, each vertex v sends a path sequence [v) to each 
of its neighbors. 

(1) When node u receives a path sequence (vi, W2, . . . , Vk) 
from its neighbor v, (note that, by construction given 
later, Vk ~ v) it does the following: 

o If u is a leaf (i.e. u is connected only 
to v), u sends back a computation sequence 
{vi,V2, ■ . . ,Wfc,u,m„(0),m„(l)) to v, where 

mu{(Ju) = 1- (25) 
o If u is not a leaf, check whether u appears among 

* If NO, u sends a path sequence (wi, . . . , Wfc, u) to 
each of u's neighbors but v. 

* If YES, then let vt^u,l<i<k. 

— If, with respect to the ordering given by 
node u to its neighbors, the rank (order) 
of node u^+i is larger then v, then u 
sends back (to v) a computation sequence 
{vi,V2, . . . ,Wfc,u,m„(0),m„(l)), where 
m„(l) = 1 and m„(0) = 0. 

— Otherwise (i.e. the rank of node u^+i is smaller 
than v), u sends back (to v) computation 
sequence (vi,W2, ■ • ■ , Wfc, TOti(O), m„(l)), 
where to„(0) = 1 and m„(l) = 0. 

(2) Once a node u receives a computation sequence 
(ui, . . . , Wfc, m^jj^ (0), m„j. (1)) from its neighbor v, (note 
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that, by construction Vk — v and Vk-i = u). Store 
this computation sequence in us memory and do the 
following: 

o If fc > 2, check whether u has stored 
computation sequences of the form 
(wi, . . . ,Ufc_i,w, m„,(0),m,,„(l)) for all 
w G N{u) — {vfc_2}- If so, u sends a computation 
sequence {vi, . . . ,Vk-i{= u), m„(0), m„(l)) to 
Vk-2 where 



max ■!/'u,i,fc_2 (o"". X 



u£N{u}-{vk-2} 




Fig. 4. Example of grid graph (left) and cris-cross graph (right) with n = 4. 

(1) Varying interaction. 6i is chosen independently from 
distribution W[— 0.05, 0.05] and dij chosen independent from 
U[-a, a] with a £ {0.2, 0.4, . . . , 2}. 

(2) Varying field. 9ij is chosen independently from distribu- 
tion U[—0.5, 0.5] and 0i chosen independently from U[—a, a] 
with a G {0.2, 0.4,..., 2}. 



1. 



(A) Grid, n=7 



Delete computation sequences 

(wi, . . . ,'i;fc_i,w,TO„(0),TO„(l)) for all 
w G N{j) — {ifc-2} from it's memory, 
o If fc = 2, then check whether for all 
w G N{j), u has stored computation sequences 
(wi, w, m.(u(0), m„,(l)). If so, compute the (estimate 
of) max-belief of u as 

^^(cr) oc 0„((t) J]^ m„(cr), and ^ql{(j)^l. 

weN{u) o-GS 

(3) When all nodes have computed their max-beliefs, declare 

g:(l)/g*(0) as an estimate of p*(l)/pj;(0) V w G 



VII. Experiments 

Our algorithm provides provably good approximation for 
any MRF that has low doubling dimension or that excluded 
minor The planar graph is a special case of such graphs. The 
popular model of grid graph, which is both planar and has low 
doubling dimension, will be used in the experimental section. 
We will, however, use the decomposition algorithm MlNOR- 
E for obtaining our results. Now we present detailed setup and 
experimental results. 




(B) Grid, n=7 



Interaction Strength 



A. Setup 1 

Considei0 binary (i.e. S 

G={V,E): 



{0, 1}) MRF on an n X n lattice 



Pr(x) (X exp j ^ OiXi + ^ 



for X G {0, 1}" 



Figure |4] shows a lattice or grid graph with n = 4 (on the 
left side). There are two scenarios for choosing parameters 
(with notation U[a,b] being uniform distribution over interval 

[a,b]y. 

^Though this setup has (pijil'ij taking negative values, they are equivalent 
to the setup considered in the paper as the function values are lower bounded 
and hence affine shift will make them non-negative without changing the 
distribution. 




Field Strengtii 



Fig. 5. Comparison of TRW, PDC and our algorithm for grid graph 
with n — 7 with respect to error in log Z. Our algorithm outperforms 
TRW and is competitive with respect to PDC. 

The grid graph is planar Hence, we run our algorithms Log 
Partition and Mode, with decomposition scheme Minor- 
e(G, 3,A), A G {3,4,5}. We consider two measures to 
evaluate performance: error in log Z, defined as \ log Z^'s - 
logZ|; and error in E(x*), defined as ;^|E(x^is"_ E(x*)|. 

We compare our algorithm for error in \ogZ with the 
two recently very successful algorithms - Tree re-weighted 
algorithm (TRW) and planar decomposition algorithm (PDC). 
The comparison is plotted in Figure |5] where n = 7 and results 
are averages over 40 trials. The Figure (A) plots error with 
respect to varying interaction while Figure (B) plots error with 
respect to varying field strength. Our algorithm, essentially 
outperforms TRW for these values of A and perform very 
competitively with respect to PDC. 

The key feature of our algorithm is scalability. Specifically, 
running time of our algorithm with a given parameter value 
A scales linearly in n, while keeping the relative error bound 
exactly the same. To explain this important feature, we plot the 
theoretically evaluated bound on error in log Z in Figure |6] with 
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Field Strength 



Field Strengtii 



Fig. 6. The theoretically computable error bounds for log Z under 
our algorithm for grid with n = 100 and n — 1000 under varying 
interaction and varying field model. This clearly shows scalability of 
our algorithm. 

tags (A), (B) and (C). Note that error bound plot is the same 
for n = 100 (A) and n = 1000 (B). Clearly, actual error is 
likely to be smaller than these theoretically plotted bounds. We 
note that these bounds only depend on the interaction strengths 
and not on the values of fields strengths (C). 

Results similar to of Log Partition are expected from 
Mode. We plot the theoretically evaluated bounds on the error 
in MAP in Figure |7] with tags (A), (B) and (C). Again, the 
bound on MAP relative error for given A parameter remains 
the same for all values of n as shown in (A) for n = 100 and 
(B) for n = 1000. There is no change in error bound with 
respect to the field strength (C). 

B. Setup 2 

Everything is exactly the same as the above setup with the 
only difference that grid graph is replaced by cris-cross graph 
which is obtained by adding extra four neighboring edges per 
node (exception of boundary nodes). Figure|4]shows cris-cross 
graph with n = 4 (on the right side). We again run the same 
algorithm as above setup on this graph. For cris-cross graph, 
which is graph with low-doubling dimension, we obtained its 
graph decomposition from the decomposition of its grid sub- 
graph. Therefore , the running time of our algorithm remains 
the same (in order) as that of grid graph and error bound will 
become only 3 times weaker than that for the grid graph. We 
compute these theoretical error bounds for log Z and MAP 
which is plotted in Figure [8] and |9] These figures are similar 
to the Figures |6] and |7] for grid graph. 



Fig. 7. The theoretically computable error bounds for MAP under 
our algorithm for grid with n = 100 and n = 1000 under varying 
interaction and varying field model. 

VIII. Unexpected implication: existence of limit 

This section describes an important and somewhat unex- 
pected implication of our results, specifically Lemmas [7] and 
|9] In the context of regular MRF, such as an MRF on Zfj (of 
n'^ nodes) with same node and edge potential functions for all 
nodes and edges, we will show that (non-trivial) limit log Z 
exists as ri oo. It is worth noting that showing existence 
of such limits is not straightforward in general and hence our 
method should be of interest as such an analytic tool. We 
believe that the result stated below is well-known; however its 
proof method is likely to allow for establishing such existence 
for a more general class of problems. As an example, the 
theorem will hold even when node and edge potentials are not 
the same but are chosen from a class of such potential as per 
some distribution in an i.i.d. fashion. Now, we state the result. 

Theorem 8: Consider a regular MRF of n'^ nodes on d- 
dimensional grid = £"„): let tpij = tl' , (j>i = cj) for all 
i eVn, G £;„ with ^ : ^ IR+, (/) : E R+. Let Z„ 
be partition function of this MRF. Then, the following limit 
exists; 

lim ^ log Zn = A[d, 0, ■(/;) e (0, oo). 

n—*oo Jl 

A. Proof of Theorem |S] 

The proof of Theorem[8]is stated for d = 2 and E = {0, 1} 
case. Proof for d > 3 and E with |E| > 2 can be proved using 
exactly the same argument. The proof will use the following 
Lemmas. 
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Fig. 8. The theoretically computable error bounds for log Z under our 
algorithm for cris-cross with n = 100 and n = 1000 under varying 
interaction and varying field model. This clearly shows scalability of 
our algorithm and robustness to graph structure. 



Fig. 9. The theoretically computable error bounds for MAP under our 
algorithm for cris-cross with n — 100 and n — 1000 under varying 
interaction and varying field model. 



Lemma 12: Let d ~ 2 and (j)* — maxo-g^o,!} "0* — 

max(^,^')e{oa}2 ^(cr,cr'). Then, 

where a = log 2 + log <p* + A log V'*- 

Lemma 13: Define a„ = ^logZ„. Now, given fc > 0, 
there exists n{k) large enough such that for any m,n > n{k), 

\arn-a,,\^o(j)+o(—^ -). 

\k J \ mm {7n,n|y 

Proof: (Theorem^ We state proof of Theorem [8] before 
proving the above stated Lemmas. First note that, by Lemma 
[T2I the elements of sequence a„ = log Z„ take value in 
[1,0;]. Now, suppose the claim of theorem is false. That is, 
sequence a„ does not converge as n 00. That is, there 
exists (5 > such for any choice of ?io, there are m > n > uq 
such that 

flm - On I > S. 

By Lemma [13] we can select k large enough and later uq > 
n{k) large enough such that for any 171,71 > uq, 

flm - On I < S. 

But this is a contradiction to our assumption that a„ does 
not converge to a limit. That is, we have established that 
Un converges to a non-trivial limit in [l,a] as desired. This 
completes the proof of Theorem |8] ■ 



B. Proofs of Lemmas 

Proof: (Lemma [72t Consider the following. 

- E ni n 1 

< En cxp(0(a;i)) Yi cxp(V'(xi,a;j)) 

= Zn 

E U^M'l^*) n exp(^*). (26) 

xe{0,l}"^ (ij)6-E„ 

Here, (a) follows from the fact that ?/;, are non-negative 
valued functions and (b) follows from definitions of (j)*,tp*. 
Now, taking logarithm on both sides implies the Lemma [12] 

■ 

Proof: (LemmaUli Given fc > 0, consider n large enough 
(will be decided later). Consider = {Vn, En) and let it be 
laid out on X — Y plane so that its node in Vn occupy the 
integral locations : < i < n — 1,0 < j < n — 1. 

Now, we describe a scheme to obtain a (0(l/fc), 0{k^)) edge- 
decomposition of Z^j. For this, choose £i,£2 € {0, . . . , k — 1} 
independently and uniformly at random. Select edges to form 
B to obtain edge-decomposition as follows: select vertical 
edges with bottom vertex having Y coordinate £2 +jk,j > 0, 
and select horizontal edges with left vertex having X coordi- 
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nate £i + jk, j > 0. That is, 

B = {{u,v) e En : u ^ {i,j),v = {i + mod k = ii 

U {(m, u) e En:u= {i,j),v = + 1), j mod k ^ £2 

It is easy to check that this is (0(l/fc), 0(fc^)) edge- 
decomposition due to uniform selection of ii,£2 from 
{0,...,k — 1}. Therefore, by Lemma [T] we can obtain 
estimates that are (1 ± 0(l/fc)) log Zn using our algorithm. 

Let m = \n/k~\. Under the decomposition B as described 
above, there are at least (m — 1)^ connected components 
that are MRF on Zj.. Also, all the connected components 
can be covered by at most (m + 1)^ identical MRFs on Z^.. 
Using arguments similar to those employed in calculations of 
Theorem[T](using non-negativity of </), tp), it can be shown that 
the estimate produced by our algorithm is lower bounded as 

(l-0(l/fc))(m-l)2logZfe = "'^^x 

{l-0{l/k)-0{k/n)), 

and is upper bounded as 

(l + 0(l/fc))(m + l)2logZ, = n^^^jS^x 

il + 0{k/n) + 0{l/k)). 
Therefore, from above discussion we obtain that 

4 log Z„ = § (1 ± 0{k/n) ± 0{l/k)) . 

ft Fh 

Therefore, recalling notation of we have that 

Wm - fflnl = akO [ — — ^ ^ ) + afeO(l/fc). 

\min {m, n\ } 

Since, aj. e for all k, we obtain the desired result of 

Lemma [T3j ■ 

IX. Conclusion 

In this paper, we present simple novel local approxima- 
tion algorithm for computing log-partition function and MAP 
estimation for arbitrary exponential distribution represented 
by a pair-wise MRF. We showed these algorithms provide 
bounds for arbitrary graph with quantifiable approximation 
guarantees. Further, for low-doubling dimension graphs and 
minor-excluded graphs it can provide arbitrary accuracy within 
linear time. The main takeaway for a practitioner is the 
following: there is a simple and intuitive local algorithm 
that provides provable bounds with computable approximation 
error for any graph and hence it can be used as a good heuristic 
and producing approximation guarantee certificate. 

We proposed message-passing implementation based on 
self-avoiding walk trees which should provide such imple- 
mentation for other problems as well. This method, through a 
transformation from non-binary exponential family to binary 
MRF, extends for any finite valued factor graph. However, this 
can result in somewhat redundant construction. Understanding 
design of direct constructions for non-binary pair-wise MRF 
is an important open problem. 

We derived an unusual implication of our algorithmic results 
for providing existence of asymptotic limits of free energy for 



a class of regular MRFs. Our result suggest a way to explicitly 
evaluate these limiting up to an arbitrary accuracy. This should 
be of general interest as a method for establishing asymptotic 
limits as well as computing these limits. 

Finally, we remark that our methods are explained for 
exponential family only. However, they easily extend to certain 
hard-core models such as independent set or matching where 
there is a non- constraining assignment to node values. 
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Appendix A 
Proof of Lemma[T] 

The proof is by induction on r e N. For base case, consider 
r = 0. Now, B(.T,2° = 1) is essentially the set of all points 
which are at distance < 1 by definition. Since it is metric with 
distance being integer, this means that the set of all points 
that are at distance 0. By definition of metric, we have that 
X is the only such point. That is, B(a;, 1) = {x}. Hence, 
|B(x, 1)1 = 1 < 2°''P'^^^ for all x G X. 

Now suppose the claim of Lemma is true for all r < fc and 
aU X € X. Consider r = k + 1 and any x € X. By definition 
of doubUng dimension, there exists £ < 2''^^' balls of radius 
2*=, say B(yj, 2*^) with e X for I < j < £, such that 

B(x,2'=+1)cU,UB(%,2'=). 

Therefore, 

i 

\B{x,2''+')\<J2\B{y„2% 

By inductive hypothesis, for I < j < i, 

\B{yj,2'')\ < 2^P^^\ 

Since we have I < 2p''^\ we obtain 

|B(a;,2'=+^)| < £ 2'=''(^) < 2'-''+'^^ p'-^I 

This completes the proof of inductive step and that of the 
Lemma [1] 

Appendix B 

Transformation: MAP in factor graph to binary 

PAIR-WISE MRF 

In this section we show that any MAP estimation problem 
is equivalent to estimating MAP in a specific binary pair-wise 
problem on a suitably constructed graph with node potentials. 
This construction is from work by Sanghavi, Shah and Willsky 
[23]. This construction is related to the "overcomplete basis" 
representation [2]. Consider the following canonical MAP 
estimation problem; suppose we are given a distribution q{y) 
over vectors y = (yi, . . . , yn) of variables ym, each of which 
can take a finite value. Suppose also that q factors into a 
product of strictly positive functions, which we find convenient 
to denote in exponential form; 

9(y) = ^ Ylc^p{My»)) = ^cxp I ^ (/)a(ya) I 

aeA \aeA I 

Here a specifies the domain of the function c/jq., and y^ is 
the vector of those variables that are in the domain of ^q. 
The a's also serve as an index for the functions. A is the 
set of functions. The MAP estimation problem is to find a 
maximizing assignment y* G argmaxyg(y). 

We now build an auxiliary graph G, and assign weights 
to its nodes, such that the MAP estimation problem above is 
equivalent to finding the MWIS of G. There is one node in 
G for each pair (a, y^), where y^ is an assignment (i.e. a set 
of values for the variables) of domain a. We will denote this 
node of G by (5(a,yQ,). 



There is an edge in G between any two nodes 8{oL\^y\^ 
and 8{ol2-,y\^ if and only if there exists a variable index m 
such that 

1) m is in both domains, i.e. m G ol\ and m G a2, and 

2) the corresponding variable assignments are different, i.e. 

Vra 7^ Vra' 

In other words, we put an edge between all pairs of nodes that 
correspond to inconsistent assignments. Given this graph G, 
we now assign weights to the nodes. Let c > be any number 
such that c + <i>a{ya) > for all a and yo,. The existence of 
such a c follows from the fact that the set of assignments and 
domains is finite. Assign to each node 5{a, ya) a weight of 
c + (f>a{ya)- Consider an example of this construction first. 
Later, we state the precise equivalence. 



00 




Fig. 10. Example of transforming MAP for factor graph to MAP in binary 
pair-wise MRF. 

Example 2: Let yi and i/2 be binary variables with joint 
distribution 

q{yi,y2) = ^ exp(6'iyi + 6'2?/2 + 6*12^1^2) 

where the 6 are any real numbers. The corresponding G is 
shown in Figure [TO] Let c be any number such that c + 9i, 
c + 62 and c + 6*12 are all greater than 0. The weights on the 
nodes in G are; 6*1 +c on node "1" on the left, 62+0 for node 
"1" on the right, O12 + c for the node "11", and c for all the 
other nodes. 

Lemma 14: Suppose q and G are as above, (a) If y* 
is a MAP estimate of q, let 5* = {(5(a,y*)|a G A} 
be the set of nodes in G that correspond to each domain 
being consistent with y*. Then, 5* is an MWIS of G. (b) 
Conversely, suppose 5* is an MWIS of G. Then, for every 
domain a, there is exactly one node (5(Q!,y*) included in 5*. 
Further, the corresponding domain assignmentsjy* | a G A} 
are consistent, and the resulting overall vector y* is a MAP 
estimate of q. 

Proof: A maximal independent set is one in which every 
node is either in the set, or is adjacent to another node that 
is in the set. Since weights are positive, any MWIS has to be 
maximal. For G and q as constructed, it is clear that 

1) If y is an assignment of variables, consider the corre- 
sponding set of nodes {(5(a, Ya) \ ct G A}. Each domain 
a has exactly one node in this set. Also, this set is an 
independent set in G, because the partial assignments 
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Ya for all the nodes are consistent with y, and hence 
with each other This means that there will not be an 
edge in G between any two nodes in the set. 
2) Conversely, if A is a maximal independent set in G, 
then all the sets of partial assignments corresponding to 
each node in A are all consistent with each other, and 
with a global assignment y. 
There is thus a one-to-one correspondence between maximal 
independent sets in G and assignments y. The lemma follows 
from this observation. ■ 



